Here’s an example of a piece of code that estimates (pi) up to a certain number of iterations from the series approximation. The while loop is used to save from creating a very large sequence vector.
This was reasonably fast on my computer. Through a bit of trial and error, I found that it takes almost 2 seconds to iterate 10 million times
Show the code
system.time((piestimate(1e7)))
user system elapsed
3.54 0.21 3.99
Now, suppose I wanted to run this for the next 10 iterations after 10 million, that’ll take just under 20 seconds.
Open up your resource monitor/task manager/activity monitor and you’ll see one processor is at 100%, doing all the work, so they’re all executed one after each other.
To have this done in parallel, consider running this explicitly on more cores. Again, look at how many CPUs at 100%. If you have more than 10 cores, only 10 will max out because we only have 10 jobs.
Suppose that we a function that should take multiple parameters. We can achieve this one of two ways. First is to directly pass the parameters in the function. By default, *apply functions take a single parameter (which can be a list/vector) so that’s often the easist approach.
Consider this simulation that finds the distribution of pvalues when the null hypothesis is violated.
Show the code
ttestSim <-function(simulationparameters){ numReps=100#simulationparameters is a list#slots are named sampleSize, trueMean, direction oneSimulation<-function(){ randomData <-with(simulationparameters,rnorm(n=sampleSize,mean=trueMean) ) ttestobj =with(simulationparameters,t.test(randomData,alternative=direction))with(ttestobj,p.value) }replicate(n=numReps,oneSimulation())}ttestSim(list(sampleSize=10,trueMean=0.5,direction="two.sided"))
We can create a table that contains all simulation scenarios that we might be interested in and then turn each row of that into a list
Show the code
simScenarios=expand.grid(sampleSize=c(5,10,25,50),trueMean=c(0,.25,.5,.75,1),direction=c("two.sided","greater","less"),stringsAsFactors = F)#turn each row into its own listscenariosList=split(simScenarios,seq(nrow(simScenarios)))
We can quickly test that this works by running through the first three rows
When running in parallel and approximate job lengths are known, it’s you usually won’t have any cases faster than when you - ordering jobs longest to shortest - applying load balancing
if (Sys.info()[["sysname"]]!="Windows"){#processor ID changes more often due to forking but total runtimes are comparable mc.answers_1_10 =mclapply(1:10,job,mc.preschedule = T,mc.cores=2) mc.answers_10_1 =mclapply(10:1,job,mc.preschedule = T,mc.cores=2) mc.answers_1_10LB =mclapply(1:10,job,mc.preschedule = F,mc.cores=2) mc.answers_10_1LB =mclapply(10:1,job,mc.preschedule = F,mc.cores=2)}
Show the code
makeggplot =function(answers,maintitle=""){require(ggplot2) allpid =sapply(answers,\(singleResult)singleResult$pid) allstart =sapply(answers,\(singleResult)singleResult$start) allend =sapply(answers,\(singleResult)singleResult$end)# --- 1. Create a data frame (from your existing vectors) --- latesttime =max(allend) earliesttime =min(allstart) job_results_df =data.frame(job_id =factor(1:length(answers)),pid =factor(allpid), # This will be the y-axis aestheticstart = allstart-earliesttime,end = allend-earliesttime ) runtime =round(latesttime-earliesttime,1)# --- 2. Plot the Gantt Chart with single color and vertical markers ---ggplot(job_results_df, aes(x = start, xend = end, y = pid, yend = pid)) +geom_segment(linewidth =4, color ="#1f78b4") +geom_segment(aes(x = end, xend = end, y =as.numeric(pid) -0.2, yend =as.numeric(pid) +0.2), linewidth =1.5, color ="#e31a1c") +# Red for End# Add labels and titlelabs(title =paste(maintitle, runtime, "seconds"),x ="Time",y ="Process ID (PID)" ) +# Use a clean theme# Remove redundant grid linestheme(panel.grid.major.y =element_blank() )}p1.parl <-makeggplot(parl.answers_1_10,"normal shortest to longest")p2.parl <-makeggplot(parl.answers_10_1,"normal longest to shortest")p3.parl <-makeggplot(parl.answers_1_10LB,"balanced shortest to longest") p4.parl <-makeggplot(parl.answers_10_1LB ,"balanced shortest to longest")require(patchwork)(p1.parl + p2.parl) / (p3.parl + p4.parl)
Show the code
if(Sys.info()[["sysname"]]!="Windows"){ p1.mc <-makeggplot(mc.answers_1_10,"normal shortest to longest") p2.mc <-makeggplot(mc.answers_10_1,"normal longest to shortest") p3.mc <-makeggplot(mc.answers_1_10LB,"balanced shortest to longest") p4.mc <-makeggplot(mc.answers_10_1LB ,"balanced shortest to longest")require(patchwork) (p1.mc + p2.mc) / (p3.mc + p4.mc)}
Below is an example from parlapply on my server: parlapplysample Below is an example from mclapply on my server: mclapplysample
These were both run on the same machine. I don’t know by the mclapply ones are much more consistent and it generally doesn’t matter. (closer to 14/15 seconds whereas the parlapply ones vary the way I expect them to). When